1 Executive Summary

In an effort to refresh my predictive modeling skills, I entered an ongoing Kaggle Competition where competitors attempt to predict house prices in Ames, Iowa.

I started this competition by just focusing on getting a good understanding of the dataset. The EDA is detailed and many visualizations are included.

Feature engineering was performed on a handful of variables, which greatly improved model performance.

The XGBoost model ended up performing very well with a cross validation RMSE of 0.1177.

2 Introduction

Kaggle describes this competition as follows:

Ask a home buyer to describe their dream house, and they probably won’t begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition’s dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

3 Loading data and Packages

# Read packages
library(ggplot2)
library(readr)
library(dplyr)
library(ggrepel)
library(scales)
library(knitr)
library(corrplot)
library(plyr)
library(randomForest)
library(gridExtra)
library(psych)
library(caret)
library(xgboost)
test <- read.csv("test.csv", stringsAsFactors = F)
train <- read.csv("train.csv", stringsAsFactors = F)


#Saving IDs in a vector for later
test_labels <- test$Id
test$Id <- NULL
train$Id <- NULL

test$SalePrice <- NA
all <- rbind(train, test)

The dataset consists of characters and integers. Most of the character variables are actually factors, but I read them in as character strings because most of them require cleaning and/or feature engineering.

4 Categorizing data types

numericVars <- which(sapply(all, is.numeric)) #index vector numeric variables
numericVarNames <- names(numericVars) #saving names vector for use later on

##str(numericVarNames)
##cat('There are', length(numericVars), 'numeric variables')

5 Exploratory Data Analysis

5.1 SalePrice

ggplot(data=all[!is.na(all$SalePrice),], aes(x=SalePrice)) +
        geom_histogram(fill="blue", binwidth = 10000) +
        scale_x_continuous(breaks= seq(0, 800000, by = 100000), labels = comma)

summary(all$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   34900  129975  163000  180921  214000  755000    1459

SalePrice is very right skewed. This was expected as very few people can afford very expensive houses. This will be kept in mind, and will be dealt with before modeling.

5.2 Most Important Numeric Predictors

numeric.vars <- which(sapply(all, is.numeric)) #index vector numeric variables
numeric.var.names <- names(numeric.vars) #saving names vector for use later on

all.num.var <- all[, numeric.vars]
cor.num.var <- cor(all.num.var, use="pairwise.complete.obs") #correlations of all numeric variables

cor.sorted <- as.matrix(sort(cor.num.var[,'SalePrice'], decreasing = TRUE)) #sort on decreasing correlations with SalePrice

#select only high corelations
CorHigh <- names(which(apply(cor.sorted, 1, function(x) abs(x)>0.5)))
cor.num.var <- cor.num.var[CorHigh, CorHigh]

corrplot.mixed(cor.num.var, tl.col="black", tl.pos = "lt", number.cex=0.75)

The correlation plot shows that 10 variables have a correlation above 0.5 with SalePrice. We also see that there is some multicollinearity going on. GarageArea has a 0.89 correlation with GarageCars. X1stFlrSF has a 0.80 correlation with TotalBsmtSF. TotRmsAbvGrd has a 0.81 correlation with GrLivArea.

5.3 Overall Quality

The variable Overall Quality has the highest correlation with SalePrice. Overall quality is defined as “The overall material and finish of the house,” and provides a ranking between 1 through 10.

ggplot(data=all[!is.na(all$SalePrice),], aes(x=factor(OverallQual), y=SalePrice))+
        geom_boxplot(col='blue') + labs(x='Overall Quality') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)

As the Overall Quality of the house increases, Sale Price increases.

5.4 Above Ground Living Area

ggplot(data=all[!is.na(all$SalePrice),], aes(x=GrLivArea, y=SalePrice))+
        geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
## `geom_smooth()` using formula = 'y ~ x'

6 Factorization and fixing NA’s

6.1 Completeness of the data

First, let’s check and see which variables have missing values.

NAcol <- which(colSums(is.na(all)) > 0)
sort(colSums(sapply(all[NAcol], is.na)), decreasing = TRUE)
##       PoolQC  MiscFeature        Alley        Fence    SalePrice  FireplaceQu 
##         2909         2814         2721         2348         1459         1420 
##  LotFrontage  GarageYrBlt GarageFinish   GarageQual   GarageCond   GarageType 
##          486          159          159          159          159          157 
##     BsmtCond BsmtExposure     BsmtQual BsmtFinType2 BsmtFinType1   MasVnrType 
##           82           82           81           80           79           24 
##   MasVnrArea     MSZoning    Utilities BsmtFullBath BsmtHalfBath   Functional 
##           23            4            2            2            2            2 
##  Exterior1st  Exterior2nd   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF  TotalBsmtSF 
##            1            1            1            1            1            1 
##   Electrical  KitchenQual   GarageCars   GarageArea     SaleType 
##            1            1            1            1            1
cat('There are', length(NAcol), 'columns with missing values')
## There are 35 columns with missing values

The 1459 NAs in SalePrice match the size of the test set perfectly. This means that NA’s need to be fixed in 34 predictor variables.

6.2 Imputing missing data and dealing with character variables

Here I will take a look at the 34 variables that contain missing values. I will begin with the variables that has the most NA’s and work my way down. If I come across a variable that actually forms a group with other variables, I will deal with them at the same time. For example, Pool, Garage and Basement all have more than one variable relating to the area.

I decided to use knitr’s “Tabs” ability to keep the document more readable. If you don’t want to read every section, the Garage and Basement sections are especially interesting.

Along with fixing NA’s, I have also converted character variables into ordinal integers and factors. I will later convert the factors into numeric variables by creating dummy variables.

6.2.1 Pool Variables

Pool Quality and the PoolArea variable

PoolQC: Pool quality

   Ex   Excellent
   Gd   Good
   TA   Average/Typical
   Fa   Fair
   NA   No Pool
   

NA means No Pool. Easy enough. The high number of NA’s makes sense as not many houses have pools.

all$PoolQC[is.na(all$PoolQC)] <- 'None'

It is also clear that this variable ordinal and can be label encoded. Because there are multiple variables that use the same quality levels of 0-5, I am going to create a vector that I can reuse later on.

Qualities <- c('None' = 0, 'Po' = 1, 'Fa' = 2, 'TA' = 3, 'Gd' = 4, 'Ex' = 5)

Now, we can use the revalue function to assign a number to each string

all$PoolQC<-as.integer(revalue(all$PoolQC, Qualities))
table(all$PoolQC)
## 
##    0    2    4    5 
## 2909    2    4    4

Only ten houses have pools.

However, there is a second variable that relates to Pools. This is the PoolArea variable (in square feet). There are 3 houses that have a PoolArea but no PoolQC. First, I checked if there was a clear relation between the PoolArea and the PoolQC. As I did not see a clear relation (bigger or smaller pools with better PoolQC), I am going to impute PoolQC values based on the Overall Quality of the houses (which is not very high for those 3 houses).

all[all$PoolArea>0 & all$PoolQC==0, c('PoolArea', 'PoolQC', 'OverallQual')]
##      PoolArea PoolQC OverallQual
## 2421      368      0           4
## 2504      444      0           6
## 2600      561      0           3
all$PoolQC[2421] <- 2
all$PoolQC[2504] <- 3
all$PoolQC[2600] <- 2

Please return to the Tabs menu to view work on other variables

6.2.2 MiscFeature

MiscFeature has 2814 NAs.

MiscFeature: Miscellaneous feature not covered in other categories

   Elev Elevator
   Gar2 2nd Garage (if not described in garage section)
   Othr Other
   Shed Shed (over 100 SF)
   TenC Tennis Court
   NA   None

This is a clear case of a factor, where NA means None.

all$MiscFeature[is.na(all$MiscFeature)] <- 'None'
all$MiscFeature <- as.factor(all$MiscFeature)

ggplot(all[!is.na(all$SalePrice),], aes(x=MiscFeature, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

table(all$MiscFeature)
## 
## Gar2 None Othr Shed TenC 
##    5 2814    4   95    1

Interestingly, properties with large sheds sold for less money than properties with no extra features. Maybe having a shed means the property doesn’t have a garage in many cases? And with there only being one house that has a tennis court, this variable isn’t particularly useful.

Please return to the Tabs menu to view work on other variables

6.2.3 Alley

Alley: Type of alley access to property

   Grvl Gravel
   Pave Paved
   NA   No alley access

Alley is a factor variable. Most houses do not have alleys.

all$Alley[is.na(all$Alley)] <- 'None'
all$Alley <- as.factor(all$Alley)

ggplot(all[!is.na(all$SalePrice),], aes(x=Alley, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

table(all$MiscFeature)
## 
## Gar2 None Othr Shed TenC 
##    5 2814    4   95    1

Ninety one properties are connected to an Alley, and, as expected, houses on paved alleys are worth more than houses on gravel.

6.2.4 Fence

Fence: Fence quality

   GdPrv    Good Privacy
   MnPrv    Minimum Privacy
   GdWo Good Wood
   MnWw Minimum Wood/Wire
   NA   No Fence
   

This is a true factor fariable as the categories are not ordinal. It seems like this should’ve been two different variables, as two categories could be true at one time. A fence could have both good privacy and good wood.

all$Fence[is.na(all$Fence)] <- 'None'

ggplot(all[!is.na(all$SalePrice),], aes(x=Fence, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

all$Fence <- as.factor(all$Fence)

6.2.5 FireplaceQu

FireplaceQu: Fireplace quality

   Ex   Excellent - Exceptional Masonry Fireplace
   Gd   Good - Masonry Fireplace in main level
   TA   Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
   Fa   Fair - Prefabricated Fireplace in basement
   Po   Poor - Ben Franklin Stove
   NA   No Fireplace
   

Ordinal factor regarding the quality of the property’s fireplace.

all$FireplaceQu[is.na(all$FireplaceQu)] <- 'None'
all$FireplaceQu<-as.integer(revalue(all$FireplaceQu, Qualities))
table(all$FireplaceQu)
## 
##    0    1    2    3    4    5 
## 1420   46   74  592  744   43
ggplot(all[!is.na(all$SalePrice),], aes(x=FireplaceQu, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

Having no fireplace is slightly better than having a low quality fireplace. The properties with the highest quality fireplaces have incredible value, more than $100,000 higher than the next level of fireplace quality.

6.2.6 Lot Variables

There are three variables related to lots. LotFrontage has NAs and LotShape and LotConfig are complete.

LotFrontage: Linear feet of street connected to property

There are almost 500 NAs. It is unreasonale for a house to have 0 feet of street property, and these values need to be imputed. I am going to impute these values by taking the median value of other houses in the same neighborhood.

ggplot(all[!is.na(all$LotFrontage),], aes(x=as.factor(Neighborhood), y=LotFrontage)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue') +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))
## Warning in geom_bar(stat = "summary", fun.y = "median", fill = "blue"):
## Ignoring unknown parameters: `fun.y`
## No summary function supplied, defaulting to `mean_se()`

for (i in 1:nrow(all)){
        if(is.na(all$LotFrontage[i])){
               all$LotFrontage[i] <- as.integer(median(all$LotFrontage[all$Neighborhood==all$Neighborhood[i]], na.rm=TRUE)) 
        }
}

LotShape: General shape of property

   Reg  Regular 
   IR1  Slightly irregular
   IR2  Moderately Irregular
   IR3  Irregular
   

No NAs. The values seem ordinal.

all$LotShape<-as.integer(revalue(all$LotShape, c('IR3'=0, 'IR2'=1, 'IR1'=2, 'Reg'=3)))
table(all$LotShape)
## 
##    0    1    2    3 
##   16   76  968 1859
sum(table(all$LotShape))
## [1] 2919

LotConfig: Lot configuration

   Inside   Inside lot
   Corner   Corner lot
   CulDSac  Cul-de-sac
   FR2  Frontage on 2 sides of property
   FR3  Frontage on 3 sides of property
   

No NAs. The values are not ordinal so I converted the variable to a factor.

all$LotConfig <- as.factor(all$LotConfig)
table(all$LotConfig)
## 
##  Corner CulDSac     FR2     FR3  Inside 
##     511     176      85      14    2133
sum(table(all$LotConfig))
## [1] 2919

6.2.7 Garages

There are 7 garage variables: GarageYrBlt, GarageFinish, GarageQual, GarageCond, GarageType, GarageCars, and GarageArea.

Two of those have one NA (GarageCars and GarageArea), one has 157 NAs (GarageType), and the other four have 159 NAs.

I am going to replace all NAs with the year the house was built.

all$GarageYrBlt[is.na(all$GarageYrBlt)] <- all$YearBuilt[is.na(all$GarageYrBlt)]

As NAs mean ‘No Garage’ for character variables, I now want to find out where the differences between the 157 NA GarageType and the other 3 character variables with 159 NAs come from.

#check if all 157 NAs are the same observations among the variables with 157/159 NAs
length(which(is.na(all$GarageType) & is.na(all$GarageFinish) & is.na(all$GarageCond) & is.na(all$GarageQual)))
## [1] 157
#Find the 2 additional NAs
kable(all[!is.na(all$GarageType) & is.na(all$GarageFinish), c('GarageCars', 'GarageArea', 'GarageType', 'GarageCond', 'GarageQual', 'GarageFinish')])
GarageCars GarageArea GarageType GarageCond GarageQual GarageFinish
2127 1 360 Detchd NA NA NA
2577 NA NA Detchd NA NA NA

The 157 NAs within GarageType all turn out to be NA in GarageCondition, GarageQuality, and GarageFinish as well. The differences are found in houses 2127 and 2577. As you can see, house 2127 actually does seem to have a Garage and house 2577 does not. Therefore, there should be 158 houses without a Garage. To fix house 2127, I will impute the most common values for GarageCond, GarageQual, and GarageFinish.

#Imputing modes.
all$GarageCond[2127] <- names(sort(-table(all$GarageCond)))[1]
all$GarageQual[2127] <- names(sort(-table(all$GarageQual)))[1]
all$GarageFinish[2127] <- names(sort(-table(all$GarageFinish)))[1]

#display "fixed" house
kable(all[2127, c('GarageYrBlt', 'GarageCars', 'GarageArea', 'GarageType', 'GarageCond', 'GarageQual', 'GarageFinish')])
GarageYrBlt GarageCars GarageArea GarageType GarageCond GarageQual GarageFinish
2127 1910 1 360 Detchd TA TA Unf

GarageCars and GarageArea: The number of cars that can fit in the garage and the area in Square Feet

Both have 1 NA. As you can see above, it is house 2577 for both variables. The problem probably occured as the GarageType for this house is “detached”, while all other Garage-variables seem to indicate that this house has no Garage.

#fixing 3 values for house 2577
all$GarageCars[2577] <- 0
all$GarageArea[2577] <- 0
all$GarageType[2577] <- NA

#check if NAs of the character variables are now all 158
length(which(is.na(all$GarageType) & is.na(all$GarageFinish) & is.na(all$GarageCond) & is.na(all$GarageQual)))
## [1] 158

Now, the 4 character variables related to garage all have the same set of 158 NAs, which correspond to ‘No Garage’. I will fix all of them in the remainder of this section

GarageType: Garage location

The values do not seem ordinal, so I will convert into a factor.

   2Types   More than one type of garage
   Attchd   Attached to home
   Basment  Basement Garage
   BuiltIn  Built-In (Garage part of house - typically has room above garage)
   CarPort  Car Port
   Detchd   Detached from home
   NA   No Garage
all$GarageType[is.na(all$GarageType)] <- 'No Garage'
all$GarageType <- as.factor(all$GarageType)
table(all$GarageType)
## 
##    2Types    Attchd   Basment   BuiltIn   CarPort    Detchd No Garage 
##        23      1723        36       186        15       778       158

GarageFinish: Interior finish of the garage

GarageFinish: Interior finish of the garage

   Fin  Finished
   RFn  Rough Finished  
   Unf  Unfinished
   NA   No Garage

These values are ordinal.

all$GarageFinish[is.na(all$GarageFinish)] <- 'NA'
all$GarageFinish<-as.integer(revalue(all$GarageFinish, c('NA'=0, 'Unf'=1, 'RFn'=2, 'Fin'=3)))
table(all$GarageFinish)
## 
##    0    1    2    3 
##  158 1231  811  719
sum(table(all$GarageFinish))
## [1] 2919
ggplot(all[!is.na(all$GarageFinish),], aes(x=GarageFinish, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))
## Warning: Removed 1459 rows containing non-finite values (`stat_summary()`).

Houses with no garages are worth more than houses with unfinished garages.

GarageQual: Garage quality

GarageQual: Garage quality

   Ex   Excellent
   Gd   Good
   TA   Typical/Average
   Fa   Fair
   Po   Poor
   NA   No Garage

Ordinal.

all$GarageQual[is.na(all$GarageQual)] <- 'None'
all$GarageQual<-as.integer(revalue(all$GarageQual, Qualities))
table(all$GarageQual)
## 
##    0    1    2    3    4    5 
##  158    5  124 2605   24    3
ggplot(all, aes(x=GarageQual, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))
## Warning: Removed 1459 rows containing non-finite values (`stat_summary()`).

Only three garages received a top rating of Excellent, and the houses are not very valuable.

GarageCond: Garage condition

GarageCond: Garage condition

   Ex   Excellent
   Gd   Good
   TA   Typical/Average
   Fa   Fair
   Po   Poor
   NA   No Garage
   

Ordinal. This variable looks to be almost the same as GarageQual. The data dictionary does not mention any distinction.

all$GarageCond[is.na(all$GarageCond)] <- 'None'
all$GarageCond<-as.integer(revalue(all$GarageCond, Qualities))
table(all$GarageCond)
## 
##    0    1    2    3    4    5 
##  158   14   74 2655   15    3
ggplot(all, aes(x=GarageCond, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))
## Warning: Removed 1459 rows containing non-finite values (`stat_summary()`).

6.2.8 Basements

There are 11 variables that related to the basement of the house. Five of them have 79-82 NAs, and six have one NA each.

#check if all 79 NAs are the same observations among the variables with 80+ NAs
length(which(is.na(all$BsmtQual) & is.na(all$BsmtCond) & is.na(all$BsmtExposure) & is.na(all$BsmtFinType1) & is.na(all$BsmtFinType2)))
## [1] 79
#Find the additional NAs; BsmtFinType1 is the one with 79 NAs
all[!is.na(all$BsmtFinType1) & (is.na(all$BsmtCond)|is.na(all$BsmtQual)|is.na(all$BsmtExposure)|is.na(all$BsmtFinType2)), c('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2')]
##      BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinType2
## 333        Gd       TA           No          GLQ         <NA>
## 949        Gd       TA         <NA>          Unf          Unf
## 1488       Gd       TA         <NA>          Unf          Unf
## 2041       Gd     <NA>           Mn          GLQ          Rec
## 2186       TA     <NA>           No          BLQ          Unf
## 2218     <NA>       Fa           No          Unf          Unf
## 2219     <NA>       TA           No          Unf          Unf
## 2349       Gd       TA         <NA>          Unf          Unf
## 2525       TA     <NA>           Av          ALQ          Unf

So altogether, it seems as if there are 79 houses without a basement, because the basement variables of the other houses with missing values are all 80% complete (missing 1 out of 5 values). I am going to impute the modes to fix the other 9 houses that only have one NA value each.

#Imputing modes.
all$BsmtFinType2[333] <- names(sort(-table(all$BsmtFinType2)))[1]
all$BsmtExposure[c(949, 1488, 2349)] <- names(sort(-table(all$BsmtExposure)))[1]
all$BsmtCond[c(2041, 2186, 2525)] <- names(sort(-table(all$BsmtCond)))[1]
all$BsmtQual[c(2218, 2219)] <- names(sort(-table(all$BsmtQual)))[1]

Now that the 5 variables considered agree upon 79 houses with ‘no basement’, I am going to factorize/hot encode them below.

BsmtQual: Evaluates the height of the basement

   Ex   Excellent (100+ inches) 
   Gd   Good (90-99 inches)
   TA   Typical (80-89 inches)
   Fa   Fair (70-79 inches)
   Po   Poor (<70 inches
   NA   No Basement

An ordinal variable so we can make use of the Qualities vector.

all$BsmtQual[is.na(all$BsmtQual)] <- 'None'
all$BsmtQual<-as.integer(revalue(all$BsmtQual, Qualities))
table(all$BsmtQual)
## 
##    0    2    3    4    5 
##   79   88 1285 1209  258

BsmtCond: Evaluates the general condition of the basement

   Ex   Excellent
   Gd   Good
   TA   Typical - slight dampness allowed
   Fa   Fair - dampness or some cracking or settling
   Po   Poor - Severe cracking, settling, or wetness
   NA   No Basement

An ordinal variable so we can make use of the Qualities vector.

all$BsmtCond[is.na(all$BsmtCond)] <- 'None'
all$BsmtCond<-as.integer(revalue(all$BsmtCond, Qualities))
table(all$BsmtCond)
## 
##    0    1    2    3    4 
##   79    5  104 2609  122

BsmtExposure: Refers to walkout or garden level walls

   Gd   Good Exposure
   Av   Average Exposure (split levels or foyers typically score average or above)  
   Mn   Mimimum Exposure
   No   No Exposure
   NA   No Basement
   

An ordinal variable.

all$BsmtExposure[is.na(all$BsmtExposure)] <- 'None'
Exposure <- c('None'=0, 'No'=1, 'Mn'=2, 'Av'=3, 'Gd'=4)

all$BsmtExposure<-as.integer(revalue(all$BsmtExposure, Exposure))
table(all$BsmtExposure)
## 
##    0    1    2    3    4 
##   79 1907  239  418  276

BsmtFinType1: Rating of basement finished area

   GLQ  Good Living Quarters
   ALQ  Average Living Quarters
   BLQ  Below Average Living Quarters   
   Rec  Average Rec Room
   LwQ  Low Quality
   Unf  Unfinshed
   NA   No Basement
   

Another ordinal variable.

all$BsmtFinType1[is.na(all$BsmtFinType1)] <- 'None'
FinType <- c('None'=0, 'Unf'=1, 'LwQ'=2, 'Rec'=3, 'BLQ'=4, 'ALQ'=5, 'GLQ'=6)

all$BsmtFinType1<-as.integer(revalue(all$BsmtFinType1, FinType))
table(all$BsmtFinType1)
## 
##   0   1   2   3   4   5   6 
##  79 851 154 288 269 429 849

BsmtFinType2: Rating of basement finished area (if multiple types)

   GLQ  Good Living Quarters
   ALQ  Average Living Quarters
   BLQ  Below Average Living Quarters   
   Rec  Average Rec Room
   LwQ  Low Quality
   Unf  Unfinshed
   NA   No Basement
   

Another ordinal variable.

all$BsmtFinType2[is.na(all$BsmtFinType2)] <- 'None'
FinType <- c('None'=0, 'Unf'=1, 'LwQ'=2, 'Rec'=3, 'BLQ'=4, 'ALQ'=5, 'GLQ'=6)

all$BsmtFinType2<-as.integer(revalue(all$BsmtFinType2, FinType))
table(all$BsmtFinType2)
## 
##    0    1    2    3    4    5    6 
##   79 2494   87  105   68   52   34

Remaining Basement variabes with just a few NAs

I now still have to deal with those 6 variables that have 1 or 2 NAs.

#display remaining NAs. Using BsmtQual as a reference for the 79 houses without basement agreed upon earlier
all[(is.na(all$BsmtFullBath)|is.na(all$BsmtHalfBath)|is.na(all$BsmtFinSF1)|is.na(all$BsmtFinSF2)|is.na(all$BsmtUnfSF)|is.na(all$TotalBsmtSF)), c('BsmtQual', 'BsmtFullBath', 'BsmtHalfBath', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF')]
##      BsmtQual BsmtFullBath BsmtHalfBath BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## 2121        0           NA           NA         NA         NA        NA
## 2189        0           NA           NA          0          0         0
##      TotalBsmtSF
## 2121          NA
## 2189           0

It should be obvious that those remaining NAs all refer to ‘not present’. Below, I am fixing those remaining variables.

BsmtFullBath: Basement full bathrooms

An integer variable.

all$BsmtFullBath[is.na(all$BsmtFullBath)] <-0
table(all$BsmtFullBath)
## 
##    0    1    2    3 
## 1707 1172   38    2

BsmtHalfBath: Basement half bathrooms

An integer variable.

all$BsmtHalfBath[is.na(all$BsmtHalfBath)] <-0
table(all$BsmtHalfBath)
## 
##    0    1    2 
## 2744  171    4

BsmtFinSF1: Type 1 finished square feet

An integer variable.

all$BsmtFinSF1[is.na(all$BsmtFinSF1)] <-0

BsmtFinSF2: Type 2 finished square feet

An integer variable.

all$BsmtFinSF2[is.na(all$BsmtFinSF2)] <-0

BsmtUnfSF: Unfinished square feet of basement area

An integer variable.

all$BsmtUnfSF[is.na(all$BsmtUnfSF)] <-0

TotalBsmtSF: Total square feet of basement area

An integer variable.

all$TotalBsmtSF[is.na(all$TotalBsmtSF)] <-0

6.2.9 Masonry Veneer

MasVnrType has 24 NAs and MasVnrArea has 23 NAs.

length(which(is.na(all$MasVnrType) & is.na(all$MasVnrArea)))
## [1] 23

All 23 of MasVnrArea have NAs. I will deal with these first and then look into the one leftover MasVnrType.

MasVnrArea: Masonry veneer area in square feet

An NA in MasVnrArea implies that the house just doesn’t have a masonry veneer. I will impute the missing values with zero.

all$MasVnrArea[is.na(all$MasVnrArea)] <- 0

MasVnrType: Masonry veneer type

   BrkCmn   Brick Common
   BrkFace  Brick Face
   CBlock   Cinder Block
   None None
   Stone    Stone

Not ordinal. Will convert into factor.

But first, let’s find the house that has a veneer area but no veneer type.

kable(all[all$MasVnrArea != 0 & is.na(all$MasVnrType), c('MasVnrArea', 'MasVnrType')])
MasVnrArea MasVnrType
2611 198 NA

Since we have no way of knowing which type of masonry this house had, let’s impute with the most common.

all$MasVnrType[2611] <- names(sort(-table(all$MasVnrType)))[2] #taking the 2nd value as the 1st is 'none'

Because MasVnrType is not ordinal it can be converted into a factor.

all$MasVnrType[is.na(all$MasVnrType)] <- 'None'
table(all$MasVnrType)
## 
##  BrkCmn BrkFace    None   Stone 
##      25     880    1765     249
all$MasVnrType <- as.factor(all$MasVnrType)

ggplot(all, aes(x=MasVnrType, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))
## Warning: Removed 1459 rows containing non-finite values (`stat_summary()`).

6.2.10 MSZoning

MSZoning: Identifies the general zoning classification of the sale.

   A    Agriculture
   C    Commercial
   FV   Floating Village Residential
   I    Industrial
   RH   Residential High Density
   RL   Residential Low Density
   RP   Residential Low Density Park 
   RM   Residential Medium Density

A factor variable which has four NAs.

#imputing the mode
all$MSZoning[is.na(all$MSZoning)] <- names(sort(-table(all$MSZoning)))[1]
all$MSZoning <- as.factor(all$MSZoning)
table(all$MSZoning)
## 
## C (all)      FV      RH      RL      RM 
##      25     139      26    2269     460
sum(table(all$MSZoning))
## [1] 2919

6.2.11 Utilities

Utilities: Type of utilities available

   AllPub   All public Utilities (E,G,W,& S)    
   NoSewr   Electricity, Gas, and Water (Septic Tank)
   NoSeWa   Electricity and Gas Only
   ELO  Electricity only

Factor variable with two NAs.

#imputing the mode
all$Utilities[is.na(all$Utilities)] <- names(sort(-table(all$Utilities)))[1]
all$Utilities <- as.factor(all$Utilities)
table(all$Utilities)
## 
## AllPub NoSeWa 
##   2918      1
sum(table(all$Utilities))
## [1] 2919

All houses but one have all public utilities.

6.2.12 Functional

Functional: Home functionality (Assume typical unless deductions are warranted)

   Typ  Typical Functionality
   Min1 Minor Deductions 1
   Min2 Minor Deductions 2
   Mod  Moderate Deductions
   Maj1 Major Deductions 1
   Maj2 Major Deductions 2
   Sev  Severely Damaged
   Sal  Salvage only
   

An ordinal factor with two NAs.

I will impute the NAs with the most common functionality, Typ.

all$Functional[is.na(all$Functional)]<-names(sort(-table(all$Functional)))[1]
functional <- c('Sal'=0, 'Sev'=1, 'Maj2'=2, 'Maj1'=3, 'Mod'=4, 'Min2'=5, 'Min1'=6, 'Typ'=7)
all$Functional <- as.integer(revalue(all$Functional, functional))
## The following `from` values were not present in `x`: Sal
table(all$Functional)
## 
##    1    2    3    4    5    6    7 
##    2    9   19   35   70   65 2719
sum(table(all$Functional))
## [1] 2919

6.2.13 Exterior Variables

Exterior1st

Exterior1st: Exterior covering on house

   AsbShng  Asbestos Shingles
   AsphShn  Asphalt Shingles
   BrkComm  Brick Common
   BrkFace  Brick Face
   CBlock   Cinder Block
   CemntBd  Cement Board
   HdBoard  Hard Board
   ImStucc  Imitation Stucco
   MetalSd  Metal Siding
   Other    Other
   Plywood  Plywood
   PreCast  PreCast 
   Stone    Stone
   Stucco   Stucco
   VinylSd  Vinyl Siding
   Wd Sdng  Wood Siding
   WdShing  Wood Shingles

Exterior2nd

Exterior2nd: Exterior covering on house (if more than one material)

   AsbShng  Asbestos Shingles
   AsphShn  Asphalt Shingles
   BrkComm  Brick Common
   BrkFace  Brick Face
   CBlock   Cinder Block
   CemntBd  Cement Board
   HdBoard  Hard Board
   ImStucc  Imitation Stucco
   MetalSd  Metal Siding
   Other    Other
   Plywood  Plywood
   PreCast  PreCast
   Stone    Stone
   Stucco   Stucco
   VinylSd  Vinyl Siding
   Wd Sdng  Wood Siding
   WdShing  Wood Shingles
   

Both variables are factor variables with only one NA. I will simply impute with the mode.

#1st
all$Exterior1st[is.na(all$Exterior1st)]<-names(sort(-table(all$Exterior1st)))[1]
table(all$Exterior1st)
## 
## AsbShng AsphShn BrkComm BrkFace  CBlock CemntBd HdBoard ImStucc MetalSd Plywood 
##      44       2       6      87       2     126     442       1     450     221 
##   Stone  Stucco VinylSd Wd Sdng WdShing 
##       2      43    1026     411      56
sum(table(all$Exterior1st))
## [1] 2919
#2nd
all$Exterior2nd[is.na(all$Exterior2nd)]<-names(sort(-table(all$Exterior2nd)))[1]
table(all$Exterior2nd)
## 
## AsbShng AsphShn Brk Cmn BrkFace  CBlock CmentBd HdBoard ImStucc MetalSd   Other 
##      38       4      22      47       3     126     406      15     447       1 
## Plywood   Stone  Stucco VinylSd Wd Sdng Wd Shng 
##     270       6      47    1015     391      81
sum(table(all$Exterior2nd))
## [1] 2919

6.2.14 Electrical

Electrical: Electrical system

   SBrkr    Standard Circuit Breakers & Romex
   FuseA    Fuse Box over 60 AMP and all Romex wiring (Average) 
   FuseF    60 AMP Fuse Box and mostly Romex wiring (Fair)
   FuseP    60 AMP Fuse Box and mostly knob & tube wiring (poor)
   Mix  Mixed

Almost ordinal, but the mixed option throws it off. Only one NA so I will impute the mode.

all$Electrical[is.na(all$Electrical)]<-names(sort(-table(all$Electrical)))[1]
all$Electrical<-as.factor(all$Electrical)
table(all$Electrical)
## 
## FuseA FuseF FuseP   Mix SBrkr 
##   188    50     8     1  2672
sum(table(all$Electrical))
## [1] 2919

6.2.15 KitchenQual

KitchenQual: Kitchen quality

   Ex   Excellent
   Gd   Good
   TA   Typical/Average
   Fa   Fair
   Po   Poor

One NA. Will impute with the mode. Ordinal so we can use the qualities vector.

all$KitchenQual[is.na(all$KitchenQual)]<-names(sort(-table(all$KitchenQual)))[1]
all$KitchenQual<-as.integer(revalue(all$KitchenQual, Qualities))
## The following `from` values were not present in `x`: None, Po
table(all$KitchenQual)
## 
##    2    3    4    5 
##   70 1493 1151  205
sum(table(all$KitchenQual))
## [1] 2919

6.2.16 SaleType

SaleType: Type of sale

   WD   Warranty Deed - Conventional
   CWD  Warranty Deed - Cash
   VWD  Warranty Deed - VA Loan
   New  Home just constructed and sold
   COD  Court Officer Deed/Estate
   Con  Contract 15% Down payment regular terms
   ConLw    Contract Low Down payment and low interest
   ConLI    Contract Low Interest
   ConLD    Contract Low Down
   Oth  Other
   

Factor variable with one NA. I will impute with the mode.

all$SaleType[is.na(all$SaleType)]<-names(sort(-table(all$SaleType)))[1]
all$SaleType<-as.factor(all$SaleType)
table(all$SaleType)
## 
##   COD   Con ConLD ConLI ConLw   CWD   New   Oth    WD 
##    87     5    26     9     8    12   239     7  2526
sum(table(all$SaleType))
## [1] 2919

6.2.17 NA Check

I will run one final check to make sure we have fixed all NA variables.

NAcol <- which(colSums(is.na(all)) > 0)
sort(colSums(sapply(all[NAcol], is.na)), decreasing = TRUE)
## SalePrice 
##      1459

Now the only column with missing values is SalePrice, which we will be predicting.

6.3 Categorizing the remaining characted variables

Now that all of the NAs have been addressed, a few variables are still improperly categorized. Character variables need to be changed to factors or ordinal factors and a few numeric variables are actually factors as well. I’ll fix the character variables now. Similar to the NA section I have created a Tabset section that you can flip through.

Charcol <- names(all[,sapply(all, is.character)])
Charcol
##  [1] "Street"        "LandContour"   "LandSlope"     "Neighborhood" 
##  [5] "Condition1"    "Condition2"    "BldgType"      "HouseStyle"   
##  [9] "RoofStyle"     "RoofMatl"      "Exterior1st"   "Exterior2nd"  
## [13] "ExterQual"     "ExterCond"     "Foundation"    "Heating"      
## [17] "HeatingQC"     "CentralAir"    "PavedDrive"    "SaleCondition"
cat('There are', length(Charcol), 'remaining columns with character values')
## There are 20 remaining columns with character values

6.3.1 Street and Driveway

Street Street: Type of road access to property

   Grvl Gravel  
   Pave Paved

Ordinal factor.

all$Street <- as.integer(revalue(all$Street, c('Grvl'=0, 'Pave'=1)))
table(all$Street)
## 
##    0    1 
##   12 2907
ggplot(all[!is.na(all$SalePrice),], aes(x=Street, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

PavedDrive PavedDrive: Paved driveway

   Y    Paved 
   P    Partial Pavement
   N    Dirt/Gravel

Ordinal Factor.

all$PavedDrive <- as.integer(revalue(all$PavedDrive, c('N'=0, 'P'=1, 'Y'=2)))
table(all$PavedDrive)
## 
##    0    1    2 
##  216   62 2641
ggplot(all[!is.na(all$SalePrice),], aes(x=PavedDrive, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

6.3.2 Land

LandContour LandContour: Flatness of the property

   Lvl  Near Flat/Level 
   Bnk  Banked - Quick and significant rise from street grade to building
   HLS  Hillside - Significant slope from side to side
   Low  Depression
   

Factor. I checked for ordinality but level properties are actually worth less than hillside and depression.

all$LandContour <- as.factor(all$LandContour)

ggplot(all[!is.na(all$SalePrice),], aes(x=LandContour, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

LandSlope LandSlope: Slope of property

   Gtl  Gentle slope
   Mod  Moderate Slope  
   Sev  Severe Slope

Factor.

all$LandSlope <- as.factor(all$LandSlope)
table(all$LandSlope)
## 
##  Gtl  Mod  Sev 
## 2778  125   16
ggplot(all[!is.na(all$SalePrice),], aes(x=LandSlope, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

6.3.3 Neighborhood

Neighborhood: Physical locations within Ames city limits

   Blmngtn  Bloomington Heights
   Blueste  Bluestem
   BrDale   Briardale
   BrkSide  Brookside
   ClearCr  Clear Creek
   CollgCr  College Creek
   Crawfor  Crawford
   Edwards  Edwards
   Gilbert  Gilbert
   IDOTRR   Iowa DOT and Rail Road
   MeadowV  Meadow Village
   Mitchel  Mitchell
   Names    North Ames
   NoRidge  Northridge
   NPkVill  Northpark Villa
   NridgHt  Northridge Heights
   NWAmes   Northwest Ames
   OldTown  Old Town
   SWISU    South & West of Iowa State University
   Sawyer   Sawyer
   SawyerW  Sawyer West
   Somerst  Somerset
   StoneBr  Stone Brook
   Timber   Timberland
   Veenker  Veenker
   

Factor.

all$Neighborhood <- as.factor(all$Neighborhood)

ggplot(all[!is.na(all$SalePrice),], aes(x=Neighborhood, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..)) +
        theme(axis.text.x = element_text(angle = -50, hjust = 0))

6.3.4 Conditions

Condition1 Condition1: Proximity to various conditions

   Artery   Adjacent to arterial street
   Feedr    Adjacent to feeder street   
   Norm Normal  
   RRNn Within 200' of North-South Railroad
   RRAn Adjacent to North-South Railroad
   PosN Near positive off-site feature--park, greenbelt, etc.
   PosA Adjacent to postive off-site feature
   RRNe Within 200' of East-West Railroad
   RRAe Adjacent to East-West Railroad

Factor.

all$Condition1 <- as.factor(all$Condition1)

ggplot(all[!is.na(all$SalePrice),], aes(x=Condition1, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..)) +
        theme(axis.text.x = element_text(angle = -40, hjust = 0))

Condition2 Condition2: Proximity to various conditions (if more than one is present)

   Artery   Adjacent to arterial street
   Feedr    Adjacent to feeder street   
   Norm Normal  
   RRNn Within 200' of North-South Railroad
   RRAn Adjacent to North-South Railroad
   PosN Near positive off-site feature--park, greenbelt, etc.
   PosA Adjacent to postive off-site feature
   RRNe Within 200' of East-West Railroad
   RRAe Adjacent to East-West Railroad
all$Condition2 <- as.factor(all$Condition2)

ggplot(all[!is.na(all$SalePrice),], aes(x=Condition2, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

6.3.5 BldgType

BldgType: Type of dwelling

   1Fam Single-family Detached  
   2FmCon   Two-family Conversion; originally built as one-family dwelling
   Duplx    Duplex
   TwnhsE   Townhouse End Unit
   TwnhsI   Townhouse Inside Unit

Factor.

all$BldgType <- as.factor(all$BldgType)

ggplot(all[!is.na(all$SalePrice),], aes(x=BldgType, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

6.3.6 House and Roof Style

HouseStyle 1Story One story 1.5Fin One and one-half story: 2nd level finished 1.5Unf One and one-half story: 2nd level unfinished 2Story Two story 2.5Fin Two and one-half story: 2nd level finished 2.5Unf Two and one-half story: 2nd level unfinished SFoyer Split Foyer SLvl Split Level Factor

all$HouseStyle <- as.factor(all$HouseStyle)

ggplot(all[!is.na(all$SalePrice),], aes(x=HouseStyle, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

RoofStyle RoofStyle: Type of roof

   Flat Flat
   Gable    Gable
   Gambrel  Gambrel (Barn)
   Hip  Hip
   Mansard  Mansard
   Shed Shed

Factor.

all$RoofStyle <- as.factor(all$RoofStyle)

ggplot(all[!is.na(all$SalePrice),], aes(x=RoofStyle, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

RoofMatl RoofMatl: Roof material

   ClyTile  Clay or Tile
   CompShg  Standard (Composite) Shingle
   Membran  Membrane
   Metal    Metal
   Roll Roll
   Tar&Grv  Gravel & Tar
   WdShake  Wood Shakes
   WdShngl  Wood Shingles

Factor.

all$RoofMatl <- as.factor(all$RoofMatl)

ggplot(all[!is.na(all$SalePrice),], aes(x=RoofMatl, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

6.3.7 Exterior

Exterior1st Exterior1st: Exterior covering on house

   AsbShng  Asbestos Shingles
   AsphShn  Asphalt Shingles
   BrkComm  Brick Common
   BrkFace  Brick Face
   CBlock   Cinder Block
   CemntBd  Cement Board
   HdBoard  Hard Board
   ImStucc  Imitation Stucco
   MetalSd  Metal Siding
   Other    Other
   Plywood  Plywood
   PreCast  PreCast 
   Stone    Stone
   Stucco   Stucco
   VinylSd  Vinyl Siding
   Wd Sdng  Wood Siding
   WdShing  Wood Shingles

Factor.

all$Exterior1st <- as.factor(all$Exterior1st)

ggplot(all[!is.na(all$SalePrice),], aes(x=Exterior1st, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..)) +
        theme(axis.text.x = element_text(angle = -40, hjust = 0))

Exterior2nd Exterior2nd: Exterior covering on house (if more than one material)

   AsbShng  Asbestos Shingles
   AsphShn  Asphalt Shingles
   BrkComm  Brick Common
   BrkFace  Brick Face
   CBlock   Cinder Block
   CemntBd  Cement Board
   HdBoard  Hard Board
   ImStucc  Imitation Stucco
   MetalSd  Metal Siding
   Other    Other
   Plywood  Plywood
   PreCast  PreCast
   Stone    Stone
   Stucco   Stucco
   VinylSd  Vinyl Siding
   Wd Sdng  Wood Siding
   WdShing  Wood Shingles

Factor.

all$Exterior2nd <- as.factor(all$Exterior2nd)

ggplot(all[!is.na(all$SalePrice),], aes(x=Exterior2nd, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..)) +
        theme(axis.text.x = element_text(angle = -40, hjust = 0))

ExterQual ExterQual: Evaluates the quality of the material on the exterior

   Ex   Excellent
   Gd   Good
   TA   Average/Typical
   Fa   Fair
   Po   Poor

Ordinal Factor.

all$ExterQual<-as.integer(revalue(all$ExterQual, Qualities))
## The following `from` values were not present in `x`: None, Po
ggplot(all[!is.na(all$SalePrice),], aes(x=ExterQual, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

ExterCond ExterCond: Evaluates the present condition of the material on the exterior

   Ex   Excellent
   Gd   Good
   TA   Average/Typical
   Fa   Fair
   Po   Poor

Ordinal Factor.

all$ExterCond<-as.integer(revalue(all$ExterCond, Qualities))
## The following `from` values were not present in `x`: None
ggplot(all[!is.na(all$SalePrice),], aes(x=ExterCond, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

6.3.8 Foundation

Foundation: Type of foundation

   BrkTil   Brick & Tile
   CBlock   Cinder Block
   PConc    Poured Contrete 
   Slab Slab
   Stone    Stone
   Wood Wood

Factor.

all$Foundation <- as.factor(all$Foundation)

ggplot(all[!is.na(all$SalePrice),], aes(x=Foundation, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

6.3.9 Heating and Air

Heating Heating: Type of heating

   Floor    Floor Furnace
   GasA Gas forced warm air furnace
   GasW Gas hot water or steam heat
   Grav Gravity furnace 
   OthW Hot water or steam heat other than gas
   Wall Wall furnace

Factor.

all$Heating <- as.factor(all$Heating)

ggplot(all[!is.na(all$SalePrice),], aes(x=Heating, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

HeatingQC HeatingQC: Heating quality and condition

   Ex   Excellent
   Gd   Good
   TA   Average/Typical
   Fa   Fair
   Po   Poor

Ordinal Factor.

all$HeatingQC<-as.integer(revalue(all$HeatingQC, Qualities))
## The following `from` values were not present in `x`: None
ggplot(all[!is.na(all$SalePrice),], aes(x=HeatingQC, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

CentralAir CentralAir: Central air conditioning

   N    No
   Y    Yes

Ordinal Factor.

all$CentralAir<-as.integer(revalue(all$CentralAir, c('N'=0,'Y'=1)))

ggplot(all[!is.na(all$SalePrice),], aes(x=CentralAir, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

6.3.10 SaleCondition

SaleCondition: Condition of sale

   Normal   Normal Sale
   Abnorml  Abnormal Sale -  trade, foreclosure, short sale
   AdjLand  Adjoining Land Purchase
   Alloca   Allocation - two linked properties with separate deeds, typically condo with a garage unit  
   Family   Sale between family members
   Partial  Home was not completed when last assessed (associated with New Homes)

Factor.

all$SaleCondition <- as.factor(all$SaleCondition)

ggplot(all[!is.na(all$SalePrice),], aes(x=SaleCondition, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))

### Double check character variables

Charcol <- names(all[,sapply(all, is.character)])
Charcol
## character(0)
cat('There are', length(Charcol), 'remaining columns with character values')
## There are 0 remaining columns with character values

6.4 Numeric Factors

A few variables that seem to be numeric at first glance are actually factors or ordinal factors. Let’s fix that.

Numcol <- names(all[,sapply(all, is.numeric)])
Numcol
##  [1] "MSSubClass"    "LotFrontage"   "LotArea"       "Street"       
##  [5] "LotShape"      "OverallQual"   "OverallCond"   "YearBuilt"    
##  [9] "YearRemodAdd"  "MasVnrArea"    "ExterQual"     "ExterCond"    
## [13] "BsmtQual"      "BsmtCond"      "BsmtExposure"  "BsmtFinType1" 
## [17] "BsmtFinSF1"    "BsmtFinType2"  "BsmtFinSF2"    "BsmtUnfSF"    
## [21] "TotalBsmtSF"   "HeatingQC"     "CentralAir"    "X1stFlrSF"    
## [25] "X2ndFlrSF"     "LowQualFinSF"  "GrLivArea"     "BsmtFullBath" 
## [29] "BsmtHalfBath"  "FullBath"      "HalfBath"      "BedroomAbvGr" 
## [33] "KitchenAbvGr"  "KitchenQual"   "TotRmsAbvGrd"  "Functional"   
## [37] "Fireplaces"    "FireplaceQu"   "GarageYrBlt"   "GarageFinish" 
## [41] "GarageCars"    "GarageArea"    "GarageQual"    "GarageCond"   
## [45] "PavedDrive"    "WoodDeckSF"    "OpenPorchSF"   "EnclosedPorch"
## [49] "X3SsnPorch"    "ScreenPorch"   "PoolArea"      "PoolQC"       
## [53] "MiscVal"       "MoSold"        "YrSold"        "SalePrice"

6.4.1 MoSold: Month Sold (MM)

The month a house is sold is a factor. Houses do not get more expensive in December (month 12) compared to January (month 1), or vice versa.

all$MoSold <- as.factor(all$MoSold)

ggplot(all[!is.na(all$MoSold),], aes(x=MoSold, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))
## Warning: Removed 1459 rows containing non-finite values (`stat_summary()`).

### YrSold: Year Sold (YYYY)

Same situation for year sold. Factor. Generally, house prices go up, but the relationship is not completely clear. As this dataset only contains houses sold within a time span of five years which also includes a major financial crisis, factor makes the most sence in this scenario.

all$YrSold <- as.factor(all$YrSold)

ggplot(all[!is.na(all$YrSold),], aes(x=YrSold, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))
## Warning: Removed 1459 rows containing non-finite values (`stat_summary()`).

### MSSubClass: Identifies the type of dwelling involved in the sale.

    20  1-STORY 1946 & NEWER ALL STYLES
    30  1-STORY 1945 & OLDER
    40  1-STORY W/FINISHED ATTIC ALL AGES
    45  1-1/2 STORY - UNFINISHED ALL AGES
    50  1-1/2 STORY FINISHED ALL AGES
    60  2-STORY 1946 & NEWER
    70  2-STORY 1945 & OLDER
    75  2-1/2 STORY ALL AGES
    80  SPLIT OR MULTI-LEVEL
    85  SPLIT FOYER
    90  DUPLEX - ALL STYLES AND AGES
   120  1-STORY PUD (Planned Unit Development) - 1946 & NEWER
   150  1-1/2 STORY PUD - ALL AGES
   160  2-STORY PUD - 1946 & NEWER
   180  PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
   190  2 FAMILY CONVERSION - ALL STYLES AND AGES
   

MSSubClass is coded with a number but is really a factor.

all$MSSubClass <- as.factor(all$MSSubClass)

ggplot(all[!is.na(all$MSSubClass),], aes(x=MSSubClass, y=SalePrice)) +
        geom_bar(stat='summary', fun=median, fill='blue') +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..))
## Warning: Removed 1459 rows containing non-finite values (`stat_summary()`).

7 Feature Engineering

7.1 Bathrooms

There are four bathroom variables. By themselves they are not very predictive but I suspect if they were combined they could add value.

There are full baths and half baths. I created a Total Bathrooms variable which adds the number of bathrooms together, with half baths receiving a value of 0.5 and full baths receiving a value of 1.0.

BsmtFullBath, BsmtHalfBath, FullBath, HalfBath

all$TotalBathrooms <- all$BsmtFullBath + all$BsmtHalfBath*(0.5) + all$FullBath + all$HalfBath*(0.5)
tb1 <- ggplot(data=all[!is.na(all$SalePrice),], aes(x=as.factor(TotalBathrooms), y=SalePrice))+
        geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
tb2 <- ggplot(data=all, aes(x=as.factor(TotalBathrooms))) +
        geom_histogram(stat='count')
grid.arrange(tb1, tb2)
## `geom_smooth()` using formula = 'y ~ x'

7.2 House Age

Let’s create a variable for how old the house is at the time of sale. We could simply subtract YearBuilt from YrSold, but I would like to incorporate when houses have been remodeled. If a house is remodeled, then I set HouseAge equal to YrSold - YearRemodAdd and if the house was not remodeled then I set HouseAge equal to YrSold - YearBuilt

# Change to numeric so we can do math
all$YrSold <- as.numeric(as.character(all$YrSold))
all$YearBuilt <- as.numeric(as.character(all$YearBuilt))
all$YearRemodAdd <- as.numeric(as.character(all$YearRemodAdd))

all <- all %>%
  mutate(HouseAge = case_when(
      YearRemodAdd == YearBuilt ~ YrSold - YearBuilt,
      .default = YrSold - YearRemodAdd
    ))

7.3 Remodeled Yes/No Flag

To go along with some houses being remodeled, it will be beneficial to have a variable that flags which houses have been remodeled.

all <- all %>%
  mutate(RemodeledFlag = case_when(
      YearRemodAdd == YearBuilt ~ 'No',
      .default = 'Yes'
    ))
all$RemodeledFlag <- as.factor(all$RemodeledFlag)
# Change back to factor
all$YrSold <- as.factor(all$YrSold)

7.4 Neighborhood

nb1 <- ggplot(all[!is.na(all$SalePrice),], aes(x=reorder(Neighborhood, SalePrice, FUN=median), y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "median", fill='blue') + labs(x='Neighborhood', y='Median SalePrice') +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3) +
        geom_hline(yintercept=163000, linetype="dashed", color = "red") #dashed line is median SalePrice
## Warning in geom_bar(stat = "summary", fun.y = "median", fill = "blue"):
## Ignoring unknown parameters: `fun.y`
nb2 <- ggplot(all[!is.na(all$SalePrice),], aes(x=reorder(Neighborhood, SalePrice, FUN=mean), y=SalePrice)) +
        geom_bar(stat='summary', fun.y = "mean", fill='blue') + labs(x='Neighborhood', y="Mean SalePrice") +
        theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=50000), labels = comma) +
        geom_label(stat = "count", aes(label = ..count.., y = ..count..), size=3) +
        geom_hline(yintercept=163000, linetype="dashed", color = "red") #dashed line is median SalePrice
## Warning in geom_bar(stat = "summary", fun.y = "mean", fill = "blue"): Ignoring
## unknown parameters: `fun.y`
grid.arrange(nb1, nb2)
## No summary function supplied, defaulting to `mean_se()`
## No summary function supplied, defaulting to `mean_se()`

There are three neighborhoods that clearly have a higher mean and median than the other neighborhoods. This is mirrored at the bottom of the range with the three poorest neighborhoods. I will bin these three into an ordinal factor.

all$RichPoor[all$Neighborhood %in% c('StoneBr', 'NridgHt', 'NoRidge')] <- 2
all$RichPoor[!all$Neighborhood %in% c('MeadowV', 'IDOTRR', 'BrDale', 'StoneBr', 'NridgHt', 'NoRidge')] <- 1
all$RichPoor[all$Neighborhood %in% c('MeadowV', 'IDOTRR', 'BrDale')] <- 0

table(all$RichPoor)
## 
##    0    1    2 
##  160 2471  288

7.5 Total Size

There is no variable for the total square footage of the house. Let’s create one which adds up the above ground and below ground square footage.

all$TotalSqFeet <- all$GrLivArea + all$TotalBsmtSF

ggplot(data=all[!is.na(all$SalePrice),], aes(x=TotalSqFeet, y=SalePrice))+
        geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma) +
        geom_text_repel(aes(label = ifelse(all$GrLivArea[!is.na(all$SalePrice)]>4500, rownames(all), '')))
## `geom_smooth()` using formula = 'y ~ x'

There’s two very clear outliers. I will deal with this in the next section.

7.6 Porches and Decks

All of the porch and deck variables are split up and won’t contribute much to the model by themselves, although I suspect they could add value if they were all added together.

all$DeckPorch <- all$WoodDeckSF + all$OpenPorchSF + all$EnclosedPorch + all$X3SsnPorch + all$ScreenPorch

ggplot(data=all[!is.na(all$SalePrice),], aes(x=DeckPorch, y=SalePrice))+
        geom_point(col='blue') + geom_smooth(method = "lm", se=FALSE, color="black", aes(group=1)) +
        scale_y_continuous(breaks= seq(0, 800000, by=100000), labels = comma)
## `geom_smooth()` using formula = 'y ~ x'

8 Prepare for Modeling

8.1 Dropping co-linear variables

Some of the variables in the dataset are co-linear. I am dropping a variable if two variables are highly correlated. To find these correlated pairs, I have used the correlations matrix again (see section 6.1). For instance: GarageCars and GarageArea have a correlation of 0.89. Of those two, I am dropping the variable with the lowest correlation with SalePrice (which is GarageArea with a SalePrice correlation of 0.62. GarageCars has a SalePrice correlation of 0.64).

dropVars <- c('YearRemodAdd', 'GarageYrBlt', 'GarageArea', 'GarageCond', 'TotalBsmtSF', 'TotalRmsAbvGrd', 'BsmtFinSF1')

all <- all[,!(names(all) %in% dropVars)]

8.2 Removing Outliers

There are two huge houses that didn’t sell for a lot of money. I will remove them from the dataset.

all <- all[-c(524, 1299),]

8.3 Normalize Variables

Before modeling I need to center and scale the ‘true numeric’ predictors (so not variables that have been label encoded), and create dummy variables for the categorical predictors. Below, I am splitting the dataframe into one with all (true) numeric variables, and another dataframe holding the (ordinal) factors.

numericVarNames <- numericVarNames[!(numericVarNames %in% c('MSSubClass', 'MoSold', 'YrSold', 'SalePrice', 'OverallQual', 'OverallCond'))] #numericVarNames was created before having done anything
numericVarNames <- append(numericVarNames, c('HouseAge', 'DeckPorch', 'TotalBathrooms', 'TotalSqFeet'))

DFnumeric <- all[, names(all) %in% numericVarNames]

DFfactors <- all[, !(names(all) %in% numericVarNames)]
DFfactors <- DFfactors[, names(DFfactors) != 'SalePrice']

cat('There are', length(DFnumeric), 'numeric variables, and', length(DFfactors), 'factor variables')
## There are 30 numeric variables, and 49 factor variables
# str(DFnumeric)

Skewness Skewness is a measure of the symmetry in a distribution. A symmetrical dataset will have a skewness equal to 0. So, a normal distribution will have a skewness of 0. Skewness essentially measures the relative size of the two tails. As a rule of thumb, skewness should be between -1 and 1. In this range, data are considered fairly symmetrical. In order to fix the skewness, I am taking the log for all numeric predictors with an absolute skew greater than 0.8 (actually: log+1, to avoid division by zero issues).

for(i in 1:ncol(DFnumeric)){
        if (abs(skew(DFnumeric[,i]))>0.8){
                DFnumeric[,i] <- log(DFnumeric[,i] +1)
        }
}

Normalizing the data

PreNum <- preProcess(DFnumeric, method=c("center", "scale"))
print(PreNum)
## Created from 2917 samples and 30 variables
## 
## Pre-processing:
##   - centered (30)
##   - ignored (0)
##   - scaled (30)
DFnorm <- predict(PreNum, DFnumeric)
dim(DFnorm)
## [1] 2917   30

8.4 Create Dummy Variables

In order to change the factor variables into numerical variables so the predictive model works correctly, I will create ‘Dummy’ variables for each factor.

DFdummies <- as.data.frame(model.matrix(~.-1, DFfactors))
dim(DFdummies)
## [1] 2917  204

8.5 Remove variables with no/few Observations

We should remove variables with no or few observations that will have little to no affect on the model.

#check if some values are absent in the test set
ZerocolTest <- which(colSums(DFdummies[(nrow(all[!is.na(all$SalePrice),])+1):nrow(all),])==0)
colnames(DFdummies[ZerocolTest])
##  [1] "UtilitiesNoSeWa"    "Condition2RRAe"     "Condition2RRAn"    
##  [4] "Condition2RRNn"     "HouseStyle2.5Fin"   "RoofMatlMembran"   
##  [7] "RoofMatlMetal"      "RoofMatlRoll"       "Exterior1stImStucc"
## [10] "Exterior1stStone"   "Exterior2ndOther"   "HeatingOthW"       
## [13] "ElectricalMix"      "MiscFeatureTenC"
DFdummies <- DFdummies[,-ZerocolTest] #removing predictors
#check if some values are absent in the train set
ZerocolTrain <- which(colSums(DFdummies[1:nrow(all[!is.na(all$SalePrice),]),])==0)
colnames(DFdummies[ZerocolTrain])
## [1] "MSSubClass150"
DFdummies <- DFdummies[,-ZerocolTrain] #removing predictor

Also taking out variables with less than 10 ‘ones’ in the train set.

fewOnes <- which(colSums(DFdummies[1:nrow(all[!is.na(all$SalePrice),]),])<10)
colnames(DFdummies[fewOnes])
##  [1] "MSSubClass40"         "LotConfigFR3"         "NeighborhoodBlueste" 
##  [4] "NeighborhoodNPkVill"  "Condition1PosA"       "Condition1RRNe"      
##  [7] "Condition1RRNn"       "Condition2Feedr"      "Condition2PosA"      
## [10] "Condition2PosN"       "RoofStyleMansard"     "RoofStyleShed"       
## [13] "RoofMatlWdShake"      "RoofMatlWdShngl"      "Exterior1stAsphShn"  
## [16] "Exterior1stBrkComm"   "Exterior1stCBlock"    "Exterior2ndAsphShn"  
## [19] "Exterior2ndBrk Cmn"   "Exterior2ndCBlock"    "Exterior2ndStone"    
## [22] "FoundationStone"      "FoundationWood"       "HeatingGrav"         
## [25] "HeatingWall"          "ElectricalFuseP"      "GarageTypeCarPort"   
## [28] "MiscFeatureOthr"      "SaleTypeCon"          "SaleTypeConLD"       
## [31] "SaleTypeConLI"        "SaleTypeConLw"        "SaleTypeCWD"         
## [34] "SaleTypeOth"          "SaleConditionAdjLand"
DFdummies <- DFdummies[,-fewOnes] #removing predictors
dim(DFdummies)
## [1] 2917  154

Altogether, I have removed 49 one-hot encoded predictors with little or no variance.

combined <- cbind(DFnorm, DFdummies) #combining all (now numeric) predictors into one dataframe 

8.6 SalePrice Skewness

skew(all$SalePrice)
## [1] 1.877427
qqnorm(all$SalePrice)
qqline(all$SalePrice)

The skew of 1.87 indicates a right skew that is too high, and the Q-Q plot shows that sale prices are also not normally distributed. To fix this I am taking the log of SalePrice.

all$SalePrice <- log(all$SalePrice) #default is the natural logarithm, "+1" is not necessary as there are no 0's
skew(all$SalePrice)
## [1] 0.1213182

As you can see,the skew is now quite low and the Q-Q plot is also looking much better.

qqnorm(all$SalePrice)
qqline(all$SalePrice)

8.7 Create Train/Test

train1 <- combined[!is.na(all$SalePrice),]
test1 <- combined[is.na(all$SalePrice),]

9 Modeling - XGBoost

xgb_grid = expand.grid(
nrounds = 1000,
eta = c(0.1, 0.05, 0.01),
max_depth = c(2, 3, 4, 5, 6),
gamma = 0,
colsample_bytree=1,
min_child_weight=c(1, 2, 3, 4 ,5),
subsample=1
)

The next step is to let caret find the best hyperparameter values (using 5 fold cross validation).

# set.seed(7)
# my_control <-trainControl(method="cv", number=5)
# 
# xgb_caret <- train(x=train1, y=all$SalePrice[!is.na(all$SalePrice)], method='xgbTree', trControl= my_control, tuneGrid=xgb_grid) 
# xgb_caret$bestTune

As expected, this took quite a bit of time (locally). In case you are running yourself I disabled the code, and am just continuing with the results. According to caret, the ‘bestTune’ parameters are:

Below, I am starting with the preparation of the data in the recommended format.

label_train <- all$SalePrice[!is.na(all$SalePrice)]

# put our testing & training data into two seperate Dmatrixs objects
dtrain <- xgb.DMatrix(data = as.matrix(train1), label= label_train)
dtest <- xgb.DMatrix(data = as.matrix(test1))

In addition, I am taking over the best tuned values from the caret cross validation.

default_param<-list(
        objective = "reg:linear",
        booster = "gbtree",
        eta=0.05, #default = 0.3
        gamma=0,
        max_depth=2, #default=6
        min_child_weight=3, #default=1
        subsample=1,
        colsample_bytree=1
)

The next step is to do cross validation to determine the best number of rounds (for the given set of parameters).

xgbcv <- xgb.cv( params = default_param, data = dtrain, nrounds = 500, nfold = 5, showsd = T, stratified = T, print_every_n = 40, early_stopping_rounds = 10, maximize = F)
## [16:53:12] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [16:53:12] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [16:53:12] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [16:53:12] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [16:53:12] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
## [1]  train-rmse:10.955590+0.004262   test-rmse:10.955574+0.018177 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 10 rounds.
## 
## [41] train-rmse:1.428353+0.000684    test-rmse:1.429110+0.010132 
## [81] train-rmse:0.228330+0.000776    test-rmse:0.235843+0.007827 
## [121]    train-rmse:0.119193+0.001264    test-rmse:0.137376+0.008352 
## [161]    train-rmse:0.107103+0.001347    test-rmse:0.129163+0.008222 
## [201]    train-rmse:0.101120+0.001198    test-rmse:0.125610+0.007949 
## [241]    train-rmse:0.096869+0.001145    test-rmse:0.123302+0.007657 
## [281]    train-rmse:0.093573+0.001134    test-rmse:0.121817+0.007461 
## [321]    train-rmse:0.090794+0.001161    test-rmse:0.121037+0.007093 
## [361]    train-rmse:0.088418+0.001041    test-rmse:0.120602+0.006958 
## [401]    train-rmse:0.086309+0.001068    test-rmse:0.120194+0.006763 
## [441]    train-rmse:0.084547+0.001029    test-rmse:0.119738+0.006478 
## Stopping. Best iteration:
## [448]    train-rmse:0.084219+0.001038    test-rmse:0.119689+0.006504

Although it was a bit of work, the hyperparameter tuning definitly paid of, as the cross validated RMSE improved considerably (from 0.1225 without the caret tuning, to 0.1177 in this version)!

#train the model using the best iteration found by cross validation
xgb_mod <- xgb.train(data = dtrain, params=default_param, nrounds = 475)
## [16:53:17] WARNING: src/objective/regression_obj.cu:213: reg:linear is now deprecated in favor of reg:squarederror.
XGBpred <- predict(xgb_mod, dtest)
predictions_XGB <- exp(XGBpred) #need to reverse the log to the real values
head(predictions_XGB)
## [1] 118689.8 165311.7 185160.2 188604.0 192461.5 165139.0
#view variable importance plot
library(Ckmeans.1d.dp) #required for ggplot clustering
mat <- xgb.importance (feature_names = colnames(train1),model = xgb_mod)
xgb.ggplot.importance(importance_matrix = mat[1:20], rel_to_first = TRUE)

10 Final Predictions

sub_avg <- data.frame(Id = test_labels, SalePrice = round(predictions_XGB, 3))
head(sub_avg)
##     Id SalePrice
## 1 1461  118689.8
## 2 1462  165311.7
## 3 1463  185160.2
## 4 1464  188604.0
## 5 1465  192461.5
## 6 1466  165139.0
write.csv(sub_avg, file = 'final_predictions.csv', row.names = F)